Add optional MLflow logging to the cross-validation CLI#407
Conversation
There was a problem hiding this comment.
Pull request overview
Adds opt-in MLflow tracking to the jabs-cli cross-validation workflow so cross-validation runs can be logged (metrics/params/tags + optional report artifact) and compared over time, while keeping MLflow as a fully optional dependency.
Changes:
- Introduces
jabs.classifier.mlflow_loggingwith helpers to aggregate CV metrics, parse tags, loadMLFLOW_*env files, resolve experiment names, and push a single MLflow run per invocation. - Extends the
cross-validationCLI with--mlflow [ENV_FILE],--mlflow-experiment,--mlflow-tag, and--mlflow-no-report, plus a distinct exit code (3) for MLflow push failures. - Adds unit tests for the logging module (with a fake injected
mlflow) and CLI option parsing; updates docs in both the online and in-app copies.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds mlflow as an optional extra in the lock metadata. |
| pyproject.toml | Declares the mlflow optional dependency extra. |
| src/jabs/classifier/init.py | Re-exports MLflow helpers and error type from the classifier package. |
| src/jabs/classifier/mlflow_logging.py | New module implementing MLflow availability checks, env loading, aggregation, tagging, and run/artifact logging. |
| src/jabs/scripts/cli/cli.py | Adds MLflow-related CLI options and wiring into run_cross_validation, including exit code mapping. |
| src/jabs/scripts/cli/cross_validation.py | Adds MLflow logging after report save, and raises MlflowLoggingError on push failure. |
| tests/classifier/test_mlflow_logging.py | New tests for metrics aggregation, tag parsing, env file loading, experiment selection, and logging behavior via fake MLflow. |
| tests/scripts/test_cross_validation_cli.py | New tests for MLflow option parsing/forwarding and exit code mapping. |
| docs/user-guide/cli-tools.md | Documents the cross-validation command and MLflow integration (online docs). |
| src/jabs/resources/docs/user_guide/cli-tools.md | Mirrors the same CLI + MLflow documentation for the in-app docs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Resolved in 7c651d8 (docstring) and the PR description has been updated.
Implementation, tests, docs (both copies), docstring, and PR description are now consistent. |
|
Both Copilot review comments are resolved in f8dfadd.
Added |
Summary
Adds opt-in MLflow tracking to the
jabs-cli cross-validationcommand. Each run can record aggregate cross-validation metrics, run parameters, descriptive tags, and the training report as an artifact, so cross-validation runs of a behavior can be compared over time. MLflow is a fully optional dependency — the base install and all existing behavior are unchanged when it isn't used.Tracks KLAUS-444.
What's included
New module
jabs.classifier.mlflow_logginglog_cross_validation_to_mlflow(...)— creates one MLflow run, logs metrics/params/tags + the report artifact, returns(run_id, tracking_uri).aggregate_cv_metrics,build_params,build_tags,resolve_experiment_name,parse_kv_tags,load_env_file,mlflow_available, andMlflowLoggingError.import mlflowis lazy (inside the logging function only), so the base package never depends on it.Optional dependency — new
mlflowextra:pip install 'jabs-behavior-classifier[mlflow]'.CLI options on
cross-validation--mlflow [ENV_FILE]— enable logging; optional.envfile withMLFLOW_*connection settings (ambient env if omitted).--mlflow-experiment NAME— override the experiment (see below).--mlflow-tag KEY=VALUE— repeatable free-form run tags.--mlflow-no-report— skip the report artifact (metrics + params only).Per-behavior experiments — runs default to experiment
jabs-<behavior>so a behavior's runs form their own leaderboard (mixing behaviors isn't comparable). Precedence:--mlflow-experiment→MLFLOW_EXPERIMENT_NAME→jabs-<behavior>. The experiment is auto-created.Leaderboard metrics —
cv_f1_behavior_mean,cv_accuracy_mean, precision/recall (mean + std), iteration count, and dataset composition are logged as MLflow metrics, so the experiment's runs table is sortable by mean F1. Full per-fold detail rides along as the report artifact.Exit codes & failure handling
--mlflowrequested but themlflowextra not installed → fail fast with an error before running cross-validation, exit1. Logging was explicitly requested but can't be honored, so the command stops rather than silently producing a run with no logging; install the extra (or drop--mlflow) and re-run.--mlflow ENV_FILEpath doesn't exist → fail fast before running cross-validation, exit1. The env-file path is validated up front (with a leading~expanded), so a typo is caught immediately rather than after the run.3(distinct from the generic1).Docs — both copies (online + in-app
cli-tools.md) gain ajabs-cli cross-validationcommand section and a detailed MLflow integration section (install, enabling, connection config, experiment selection, leaderboard, tags, exit codes).Example
jabs-cli cross-validation /path/to/project --behavior grooming \ --mlflow settings.env --mlflow-tag purpose=baseline # -> logs to experiment "jabs-grooming"Testing
tests/classifier/test_mlflow_logging.py(logging module, with a fakemlflowinjected — no server/network) and MLflow CLI option-parsing tests intests/scripts/test_cross_validation_cli.py.tests/classifier/+tests/scripts/: 305 passed.ruff check/formatclean.